{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Generate Custom Benchmark\n",
    "\n",
    "This notebook walks through how to generate a custom benchmark based on your data.\n",
    "\n",
    "We will be using OpenAI for our embedding model and LLM, but this can easily be switched out:\n",
    "- Various embedding functions are provided in `embedding_functions.py`\n",
    "- LLM prompts are provided in `llm_functions.py`\n",
    "\n",
    "NOTE: When switching out embedding models, you will need to make a new collection for your new embeddings. Then, embed the same documents and queries with the embedding model of your choice. \n",
    "\n",
    "Use the same golden dataset of queries when comparing embedding models on the same data.\n",
    "\n",
    "Cells that should be modified when switching out embedding models are labeled as **[Modify]**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 Install & Import\n",
    "\n",
    "Install the necessary packages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "import chromadb\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import json\n",
    "import os\n",
    "import dotenv\n",
    "from pathlib import Path\n",
    "from datetime import datetime\n",
    "from openai import OpenAI as OpenAIClient\n",
    "from anthropic import Anthropic as AnthropicClient\n",
    "from functions.llm import *\n",
    "from functions.embed import *\n",
    "from functions.chroma import *\n",
    "from functions.evaluate import *\n",
    "from functions.visualize import *\n",
    "from functions.types import *\n",
    "\n",
    "dotenv.load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Set Variables"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use pre-chunked [Chroma Docs](https://docs.trychroma.com/docs/overview/introduction) as an example. To run this notebook with your own data, uncomment the commented out lines and fill in.\n",
    "\n",
    "**[Modify]** `COLLECTION_NAME` when you change your embedding model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('data/chroma_docs.json', 'r') as f:\n",
    "    corpus = json.load(f)\n",
    "\n",
    "context = \"This is a technical support bot for Chroma, a vector database company often used by developers for building AI applications.\"\n",
    "example_queries = \"\"\"\n",
    "    how to add to a collection\n",
    "    filter by metadata\n",
    "    retrieve embeddings when querying\n",
    "    how to use openai embedding function when adding to collection\n",
    "    \"\"\"\n",
    "\n",
    "COLLECTION_NAME = \"chroma-docs-openai-large\" # change this collection name whenever you switch embedding models\n",
    "\n",
    "# Generate a Benchmark with your own data:\n",
    "\n",
    "# with open('filepath/to/your/data.json', 'r') as f:\n",
    "#     corpus = json.load(f)\n",
    "\n",
    "# context = \"FILL IN WITH CONTEXT RELEVANT TO YOUR USE CASE\"\n",
    "# example_queries = \"FILL IN WITH EXAMPLE QUERIES\"\n",
    "\n",
    "# COLLECTION_NAME = \"YOUR COLLECTION NAME\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Load API Keys\n",
    "\n",
    "To use Chroma Cloud, you can sign up for a Chroma Cloud account [here](https://www.trychroma.com/) and create a new database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Embedding Model & LLM\n",
    "OPENAI_API_KEY = os.getenv(\"OPENAI_API_KEY\")\n",
    "\n",
    "# If you want to use Chroma Cloud, uncomment and fill in the following:\n",
    "# CHROMA_TENANT = \"YOUR CHROMA TENANT ID\"\n",
    "# X_CHROMA_TOKEN = \"YOUR CHROMA API KEY\"\n",
    "# DATABASE_NAME = \"YOUR CHROMA DATABASE NAME\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3 Set Clients\n",
    "\n",
    "Initialize the clients."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chroma_client = chromadb.Client()\n",
    "\n",
    "# If you want to use Chroma Cloud, uncomment the following line:\n",
    "# chroma_client = chromadb.HttpClient(\n",
    "#   ssl=True,\n",
    "#   host='api.trychroma.com',\n",
    "#   tenant=CHROMA_TENANT,\n",
    "#   database=DATABASE_NAME,\n",
    "#   headers={\n",
    "#     'x-chroma-token': X_CHROMA_TOKEN\n",
    "#   }\n",
    "# )\n",
    "\n",
    "openai_client = OpenAIClient(api_key=OPENAI_API_KEY)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Create Chroma Collection\n",
    "\n",
    "If you already have a Chroma Collection for your data, skip to **2.3**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Load in Your Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus_ids = list(corpus.keys())\n",
    "corpus_documents = [corpus[key] for key in corpus_ids]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Embed Data & Add to Chroma Collection\n",
    "\n",
    "Embed your documents using an embedding model of your choice. We use Openai's text-embedding-3-large here, but have other functions available in `embed.py`. You may also define your own embedding function.\n",
    "\n",
    "We use batching and multi-threading for efficiency.\n",
    "\n",
    "**[Modify]** embedding function (`openai_embed_in_batches`) to the embedding model you wish to use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus_embeddings = openai_embed_in_batches(\n",
    "    openai_client=openai_client,\n",
    "    texts=corpus_documents,\n",
    "    model=\"text-embedding-3-large\",\n",
    ")\n",
    "\n",
    "corpus_collection = chroma_client.get_or_create_collection(\n",
    "    name=COLLECTION_NAME,\n",
    "    metadata={\"hnsw:space\": \"cosine\"}\n",
    ")\n",
    "\n",
    "collection_add_in_batches(\n",
    "    collection=corpus_collection,\n",
    "    ids=corpus_ids,\n",
    "    texts=corpus_documents,\n",
    "    embeddings=corpus_embeddings,\n",
    ")\n",
    "\n",
    "corpus = {\n",
    "    id: {\n",
    "        'document': document,\n",
    "        'embedding': embedding\n",
    "    }\n",
    "    for id, document, embedding in zip(corpus_ids, corpus_documents, corpus_embeddings)\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corpus_collection = chroma_client.get_collection(\n",
    "    name=COLLECTION_NAME\n",
    ")\n",
    "\n",
    "corpus = get_collection_items(\n",
    "    collection=corpus_collection\n",
    ")\n",
    "\n",
    "corpus_ids = [key for key in corpus.keys()]\n",
    "corpus_documents = [corpus[key]['document'] for key in corpus_ids]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Filter Documents for Quality\n",
    "\n",
    "We begin by filtering our documents prior to query generation, this step ensures that we avoid generating queries from irrelevant or incomplete documents."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Set Criteria\n",
    "\n",
    "We use the following criteria:\n",
    "- `relevance` checks whether the document is relevant to the specified context\n",
    "- `completeness` checks for overall quality of the document\n",
    "\n",
    "You can modify the criteria as you see fit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "relevance = f\"The document is relevant to the following context: {context}\"\n",
    "completeness = \"The document is complete, meaning that it contains useful information to answer queries and does not only serve as an introduction to the main content that users may be looking for.\"\n",
    "\n",
    "criteria = [relevance, completeness]\n",
    "criteria_labels = [\"relevance\", \"completeness\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Filter Documents\n",
    "\n",
    "We filter our documents using gpt-4o-mini. Batching functions are also available in `llm.py`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filtered_document_ids = filter_documents(\n",
    "    client=openai_client,\n",
    "    model=\"gpt-4o-mini\",\n",
    "    documents=corpus_documents,\n",
    "    ids=corpus_ids,\n",
    "    criteria=criteria,\n",
    "    criteria_labels=criteria_labels\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "passed_documents = [corpus[id]['document'] for id in filtered_document_ids]\n",
    "\n",
    "failed_document_ids = [id for id in corpus_ids if id not in filtered_document_ids]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 View Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Number of documents passed: {len(filtered_document_ids)}\")\n",
    "print(f\"Number of documents failed: {len(failed_document_ids)}\")\n",
    "print(\"-\"*80)\n",
    "print(\"Example of passed document:\")\n",
    "print(corpus[filtered_document_ids[0]]['document'])\n",
    "print(\"-\"*80)\n",
    "print(\"Example of failed document:\")\n",
    "print(corpus[failed_document_ids[0]]['document'])\n",
    "print(\"-\"*80)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Generate Golden Dataset\n",
    "\n",
    "Using our filtered documents, we can genereate a golden dataset of queries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 Create Custom Prompt\n",
    "\n",
    "We will use `context` and `example_queries` for query generation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Generate Queries\n",
    "\n",
    "Generate queries with gpt-4o. Batching functions are available in `llm.py`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "golden_dataset = create_golden_dataset(\n",
    "    client=openai_client,\n",
    "    model=\"gpt-4o\",\n",
    "    documents=passed_documents,\n",
    "    ids=filtered_document_ids,\n",
    "    context=context,\n",
    "    example_queries=example_queries\n",
    ")\n",
    "\n",
    "golden_dataset.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Evaluate\n",
    "\n",
    "Now that we have our golden dataset, we will can run our evaluation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1 Prepare Inputs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "queries = golden_dataset['query'].tolist()\n",
    "ids = golden_dataset['id'].tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Embed generated queries.\n",
    "\n",
    "**[Modify]** embedding function (`openai_embed_in_batches`) to the embedding model you wish to use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_embeddings = openai_embed_in_batches(\n",
    "    openai_client=openai_client,\n",
    "    texts=queries,\n",
    "    model=\"text-embedding-3-large\"\n",
    ")\n",
    "\n",
    "query_embeddings_lookup_dict = {\n",
    "    id: QueryItem(\n",
    "        text=query,\n",
    "        embedding=embedding\n",
    "    )\n",
    "    for id, query, embedding in zip(ids, queries, query_embeddings)\n",
    "}\n",
    "\n",
    "query_embeddings_lookup = QueryLookup(lookup=query_embeddings_lookup_dict)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create our qrels (query relevance labels) dataframe. In this case, each query and its corresponding document share the same id."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "qrels = pd.DataFrame(\n",
    "    {\n",
    "        \"query-id\": ids,\n",
    "        \"corpus-id\": ids,\n",
    "        \"score\": 1\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 Run Benchmark"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processing batches: 100%|██████████| 1/1 [00:00<00:00, 18.79it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NDCG@1: 0.61364\n",
      "NDCG@3: 0.7224\n",
      "NDCG@5: 0.75956\n",
      "NDCG@10: 0.76766\n",
      "MAP@1: 0.61364\n",
      "MAP@3: 0.69697\n",
      "MAP@5: 0.71742\n",
      "MAP@10: 0.72121\n",
      "Recall@1: 0.61364\n",
      "Recall@3: 0.79545\n",
      "Recall@5: 0.88636\n",
      "Recall@10: 0.90909\n",
      "P@1: 0.61364\n",
      "P@3: 0.26515\n",
      "P@5: 0.17727\n",
      "P@10: 0.09091\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "results = run_benchmark(\n",
    "    query_embeddings_lookup=query_embeddings_lookup,\n",
    "    collection=corpus_collection,\n",
    "    qrels=qrels\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save results.\n",
    "\n",
    "This is helpful for comparison (e.g. comparing different embedding models).\n",
    "\n",
    "**[Modify]** \"model\" to the model you are using"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "timestamp = datetime.now().strftime(\"%Y-%m-%d--%H-%M-%S\")\n",
    "results_to_save = {\n",
    "    \"model\": \"text-embedding-3-large\",\n",
    "    \"results\": results\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_dir = Path(\"results\")\n",
    "\n",
    "with open(os.path.join(results_dir, f'{timestamp}.json'), 'w') as f:\n",
    "    json.dump(results_to_save, f)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
